This is the first post in a series where I'll show how I use pandas on real-world datasets.
For this post, we'll look at data I collected with Cyclemeter on my daily bike ride to and from school last year. I had to manually start and stop the tracking at the beginning and end of each ride. There may have been times where I forgot to do that, so we'll see if we can find those.
Let's begin in the usual fashion, a bunch of imports and loading our data.
In [1]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from IPython import display
In [2]:
!ls data | head -n 5
Take a look at the first one to see how the file's laid out.
In [3]:
df = pd.read_csv('data/Cyclemeter-Cycle-20130801-0707.csv')
df.head()
Out[3]:
In [4]:
df.info()
Pandas has automatically parsed the headers, but it could use a bit of help on some dtypes.
We can see that the Time column is a datetime but it's been parsed as an object dtype.
This is pandas' fallback dtype that can store anything, but its operations won't be optimized like
they would on an float or bool or datetime[64]. read_csv takes a parse_dates parameter, which
we'll give a list of column names.
In [5]:
date_cols = ["Time", "Ride Time", "Stopped Time", "Pace", "Average Pace"]
df = pd.read_csv("data/Cyclemeter-Cycle-20130801-0707.csv",
parse_dates=date_cols)
display.display_html(df.head())
df.info()
One minor issue is that some of the dates are parsed as datetimes when they're really just times.
We'll take care of that later. Pandas store everything as datetime64. For now we'll keep them as
datetimes, and remember that they're really just times.
Now let's do the same thing, but for all the files.
Use a generator expression to filter down to just csv's that match the simple condition of having the correct naming style. I try to use lazy generators instead of lists wherever possible. In this case the list is so small that it really doesn't matter, but it's a good habit.
In [6]:
import os
csvs = (f for f in os.listdir('data') if f.startswith('Cyclemeter')
and f.endswith('.csv'))
I see a potential problem: We'll potentailly want to concatenate each csv together
into a single DataFrame. However we'll want to retain some idea of which specific
ride an observation came from. So let's create a ride_id variable, which will
just be an integar ranging from $0 \ldots N$, where $N$ is the number of rides.
Make a simple helper function to do this, and apply it to each csv.
In [7]:
def read_ride(path_, i):
"""
read in csv at path, and assign the `ride_id` variable to i.
"""
date_cols = ["Time", "Ride Time", "Stopped Time", "Pace", "Average Pace"]
df = pd.read_csv(path_, parse_dates=date_cols)
df['ride_id'] = i
return df
dfs = (read_ride(os.path.join('data', csv), i)
for (i, csv) in enumerate(csvs))
Now concatenate together. The original indicies are meaningless, so we'll ignore them in the concat.
In [8]:
df = pd.concat(dfs, ignore_index=True)
df.head()
Out[8]:
Great! The data itself is clean enough that we didn't have to do too much munging.
Let's persist the merged DataFrame. Writing it out to a csv would be fine, but I like to use
pandas' HDF5 integration (via pytables) for personal projects.
In [9]:
df.to_hdf('data/cycle_store.h5', key='merged',
format='table')
I used the table format in case we want to do some querying on the HDFStore itself, but we'll save that for next time.
That's it for this post. Next time, we'll do some exploratry data analysis on the data.